In [1]:
from IPython.display import Image
In [2]:
Image("/Users/mayurjain/Desktop/cv_images/Decision_tree.png")
Out[2]:
Let's consider the plot between Test vs Grades for Student's Admission. (S == Student, x==test, y==grades)
First, we split the plot at x=5 with a vertical line, since it gives major junk of positive result, with 5 False positive on bottom left and 5 False Negative on top left. Simulanteously check the tree on the right, where value less than 5 the 'S' is rejected and value greater than 5 the 'S' is accepted.
Next, we make an horizontal cut at top left where y = 7 i.e. above 7 the 'S' is accepted and below '7' the 'S' is rejected.
Next, we make an horizontal cut at bottom right where y = 2 i.e. above 2 the 'S' is accepted and below '2' the 'S' is rejected.
So thus, we can make a decision tree as shown on the right.
Consider the three states of matter - Solid, Liquid and Gas. Each state consists of particle, which are available to move. Solid has lowest freedom to move, compared to Liquid and Gas. And Liquid has lower freedom than Gas. Now, Entropy measures how much freedom does an particle have to move around.
So Entropy of Solid is low, medium for Liquid and high for Gas. Entropy can be associated with Probability, for instance, we have 3 buckets with 4 balls in each bucket as seen below.
In [4]:
Image("/Users/mayurjain/Desktop/cv_images/Entropy.png")
Out[4]:
RB - Red Ball and BB - Blue Ball
Now, 1st bucket has 4 RB, 2nd bucket has 3RB and 1BB, 3rd bucket has 2RB and 2BB. As per Entropy, we have maximum rigidity/minimum freedom with solid state and here with 4 RB we have solid output/probability, since we don't have any alternating probability. The probability of getting 4 RB is 1.
And as we move to 2nd Bucket, we have medium rigidity/freedom, since we have greater probability of getting RB than BB. We have probability of getting 3RB and 1BB as 0.105.
While in 3rd Bucket, we have very low rigidity and high freedom, since we have equal probability of getting RB and BB. The probability of getting 2RB and 2BB is 0.0625.
In [5]:
Image("/users/mayurjain/Desktop/cv_images/entropy_log.png")
Out[5]:
If we have thousand balls then multiplication of these probabilities will result in very small values. So we do -log of these product and find the average of it, such that we get the positive value of the entropy. Thus we can see the 3rd bucket has maximum entropy which means it has max. freedom among other combination.
In [6]:
Image("/users/mayurjain/Desktop/cv_images/info_gain_1.png")
Out[6]:
In [7]:
Image("/users/mayurjain/Desktop/cv_images/info_gain_2.png")
Out[7]:
In [9]:
Image("/Users/mayurjain/Desktop/cv_images/IG_formula.png")
Out[9]:
Decision tree calculates the entropy for each split and based on the entropy it calculates the Information gain. More the information gain better is the decision taken. For instance, in the above image on the left, the IG = 0, because even after the split there is no clear distinction between the data point. Whereas on the right side the IG = 1, because the split from parent to child nodes gives clear distinction between the data points.
If there are 100's of columns from which the decision tree is to be made, then Decision Tree tends to overfit these columns. It is like making many horizontal and vertical cuts on the plots to classify each data point. To overcome this challenge, we go for Random Forest.
In [2]:
Image("/Users/mayurjain/Desktop/cv_images/random_forest.png")
Out[2]:
To perform random forest, we select random subset of columns from the superset and build a decision tree on those subset of columns. Similarly, we develop multiple decision tree with different subset of columns and predict the label on each. And at the end, we perform voting on the number of label output. In the above image, we have 3 decision tree and each gives an output like snapchat, whatsapp and whatsapp. So by voting we get whatsapp has the output.
Maximum Depth
The maximum depth of a decision tree is simply the largest possible length between the root to a leaf.
A tree of maximum length 'k', can have at most 2^k leaves.
Minimum number of samples to split
A node must have at least min_samples_split samples in order to be large enough to split. If a node has fewer samples than min_samples_split samples, it will not be split, and the splitting process stops.
Minimum number of samples per leaf
When splitting a node, one could run into the problem of having 99 samples in one of them, and 1 on the other. This will not take us too far in our process, and would be a waste of resources and time. If we want to avoid this, we can set a minimum for the number of samples we allow on each leaf.
This number can be specified as an integer or as a float. If it's an integer, it's the minimum number of samples allowed in a leaf. If it's a float, it's the minimum percentage of samples allowed in a leaf. For example, 0.1, or 10%, implies that a particular split will not be allowed if one of the leaves that results contains less than 10% of the samples in the dataset.
If a threshold on a feature results in a leaf that has fewer samples than min_samples_leaf, the algorithm will not allow that split, but it may perform a split on the same feature at a different threshold, that does satisfy min_samples_leaf.
In [ ]: